class: center, middle, inverse, title-slide # MDaS ### John Tipton ### 8/20/2021 --- # MDaS 2021 Welcome: - A little about me --- # What is data science? -- <img src="data:image/png;base64,#./images/frequentists_vs_bayesians_2x.png" width="1248" /> --- # What is data science? - Data - Statistics - Computation - Interpretation/communication --- # What is my experience in data science - Collaborative - Dynamic - Communication and interpersonal skills --- # Modeling In general, all statistical models (AI/ML/etc. -- choose your favorite term) are some flavor of **regression!** $$ y_i = \beta_0 + \beta_1 x_i + \epsilon_i $$ ```r n <- 500 beta_0 <- 2 beta_1 <- 3 x <- seq(-1, 1, length = n) epsilon <- rnorm(n, 0, 0.25) y <- beta_0 + beta_1 * x + epsilon ``` --- # Modeling <img src="data:image/png;base64,#index_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> --- # Modeling * If all we have is a linear model, all we can fit are linear functions * Is this a good model fit? <img src="data:image/png;base64,#index_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> --- # Modeling * How do we make that model better? ## Modeling * Quadratic term? $$ y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \epsilon_i $$ --- # Modeling * In general, statistical modeling can be written as $$ y_i = f(x_i) + \epsilon_i $$ .pull-left[ * Goal: find the function `\(f(\cdot)\)` that best approximates the observed data `\(y_i\)` given inputs `\(x_i\)` * Example: let `\(x_i\)` be an image, `\(y_i\)` is the label cat, `\(f(\cdot) = ???\)` ] .pull-right[  ] --- # Modeling * The goal is to model `\(f(\cdot)\)` -- * What ways can we improve the model `\(f(\cdot)?\)` -- * Break up `\(f(\cdot)\)` into small pieces -- * Called a "basis" for `\(f(\cdot)\)` * 1 and `\(x\)` are a basis for linear functions * 1, `\(x\)`, and `\(x^2\)` are a basis for quadratic functions --- # Modeling * Is this basis of step functions a better fit? .pull-left[ <!-- --> ] .pull-right[ <!-- --> ] --- # Modeling * What about this basis of "smooth" functions .pull-left[ <!-- --> ] .pull-right[ <!-- --> ] --- # Projects - Climate from tree rings - Mineral formation and sedimentation - Cancer identification - Historical forest distributions --- # Climate proxy data * Many ecological and physical processes respond to climate over different time scales. * Tree rings, corals, forest landscapes, ice rings, lake levels, etc. * These processes are called **climate proxies**. * They are proxy measurements for unobserved climate. * Noisy and messy. * Respond to a wide variety of non-climatic signals. --- # PalEON <center> <img src="data:image/png;base64,#images/atlantic-1.png" alt="drawing" width="500"/> </center> \blfootnote{\tiny{\textit{http://www.theatlantic.com/science/archive/2016/03/the-worlds-most-urgent-science-project/474558}.}} --- # PalEON <center> <img src="data:image/png;base64,#images/paleon_logo.png" alt="drawing" width="500"/> <img src="data:image/png;base64,#images/atlantic-2.png" alt="drawing" width="500"/> </center> --- # Forward Model <div class="newspaper"> <center> <img src="data:image/png;base64,#images/model-forward.png" alt="drawing" width="500"/> </center> <hr style="height:20px; visibility:hidden;" /> <center><font size="+6">Data.</font></center> <hr style="height:80px; visibility:hidden;" /> <center><font size="+6">Data Model.</font></center> <hr style="height:80px; visibility:hidden;" /> <center><font size="+6">Process Model.</font></center> <hr style="height:20px; visibility:hidden;" /> </div> --- # Inverse Model <div class="newspaper"> <center> <img src="data:image/png;base64,#images/model-inverse.png" alt="drawing" width="500"/> </center> <hr style="height:20px; visibility:hidden;" /> <center><font size="+6">Data.</font></center> <hr style="height:80px; visibility:hidden;" /> <center><font size="+6">Data Model.</font></center> <hr style="height:80px; visibility:hidden;" /> <center><font size="+6">Process Model.</font></center> <hr style="height:20px; visibility:hidden;" /> </div> --- # The Data <center> <img src="data:image/png;base64,#images/introData.png" alt="drawing" width="1000"/> </center> --- # Data Model \begin{align*} [\mathbf{z}, \boldsymbol{\theta}_D, \boldsymbol{\theta}_P | \mathbf{y}] & \propto \color{red}{[\mathbf{y} | \boldsymbol{\theta}_D, \mathbf{z}]} [\mathbf{z} | \boldsymbol{\theta}_P] [\boldsymbol{\theta}_D] [\boldsymbol{\theta}_P] \end{align*} --- # Climate Data Model \begin{align*} \begin{pmatrix} \mathbf{T}_{t} \\ \mathbf{P}_{t} \end{pmatrix} = \begin{pmatrix} \mathbf{T}_{\mbox{Jan}, t} \\ \vdots \\ \mathbf{T}_{\mbox{Dec}, t} \\ \mathbf{P}_{\mbox{Jan}, t} \\ \vdots \\ \mathbf{P}_{\mbox{Dec}, t} \\ \end{pmatrix} \sim \color{red}{\mbox{N} \left( \mathbf{A} \begin{pmatrix} \mathbf{T}_{t-1} \\ \mathbf{P}_{t-1} \end{pmatrix}, \boldsymbol{\Sigma} \right).} \end{align*} <hr style="height:20px; visibility:hidden;" /> * Temperature and precipitation in sequential months (years) are more related to each other than months (years) far apart. --- # Tree Ring Data Model \begin{align*} y_{i t j} & \sim \color{red}{\begin{cases} \mbox{N}\left(\beta_{0_j} + \beta_{1_j} f^{VS}\left(\mathbf{w}_t, \boldsymbol{\theta}^{VS}_j\right), \sigma^2_j \right) & \mbox{if } z_j = 0,\\ \mbox{N}\left(\tilde{\beta}_{0_j} + \tilde{\beta}_{1_j} f^{Pro}\left(\mathbf{w}_t, \boldsymbol{\theta}^{Pro}_j\right), \tilde{\sigma}^2_j \right) & \mbox{if } z_j = 1. \\ \end{cases}} \end{align*} <hr style="height:20px; visibility:hidden;" /> * Regress observed tree ring `\(y_{itj}\)` onto simulated tree rings `\(f^{VS}\left(\mathbf{w}_t, \boldsymbol{\theta}^{VS}_j\right)\)` and `\(f^{Pro}\left(\mathbf{w}_t, \boldsymbol{\theta}^{Pro}_j\right)\)`. <hr style="height:20px; visibility:hidden;" /> * `\(z_j\)` - species specific growth model form (VS or Pro). <hr style="height:20px; visibility:hidden;" /> * Model chooses best growth model form for each species. --- # Process Model \begin{align*} [\mathbf{z}, \boldsymbol{\theta}_D, \boldsymbol{\theta}_P | \mathbf{y}] & \propto [\mathbf{y} | \boldsymbol{\theta}_D, \mathbf{z}] \color{blue}{[\mathbf{z} | \boldsymbol{\theta}_P]}[\boldsymbol{\theta}_D] [\boldsymbol{\theta}_P] \end{align*} --- # Tree Ring Growth Model <center> <img src="data:image/png;base64,#images/rampFunction.png" alt="drawing" width="500"/> </center> <hr style="height:20px; visibility:hidden;" /> \begin{align*} f^{\ell}\left(\mathbf{w}_t, \boldsymbol{\theta}^{\ell}_j\right) & = \color{blue}{ \sum_{s=1}^{12} \chi_s \mbox{min} \left( g^{\ell}\left( T_{t,s}, \boldsymbol{\theta}^{\ell}_j \right), g^{\ell}\left( P_{t, s}, \boldsymbol{\theta}^{\ell}_j \right) \right),} \\ & \ell = VS \mbox{ or } Pro. \end{align*} --- # Tree Ring Growth Model <video controls loop><source src="index_files/figure-html/unnamed-chunk-11.webm" /></video> --- # The Inverse Problem <video controls loop><source src="index_files/figure-html/unnamed-chunk-12.webm" /></video> --- # Prior Model \begin{align*} [\mathbf{z}, \boldsymbol{\theta}_D, \boldsymbol{\theta}_P | \mathbf{y}] & \propto [\mathbf{y} | \boldsymbol{\theta}_D, \mathbf{z}] [\mathbf{z} | \boldsymbol{\theta}_P]\color{orange}{[\boldsymbol{\theta}_D] [\boldsymbol{\theta}_P]} \end{align*} --- # Prior model * Assumes growth responses follow "ecological niche." * Tree species that grow in the Hudson Valley respond to similar climate so have similar responses. * Variations from common response are to exploit an "ecological niche" that allows many species to exist on the same landscape. --- # Ecological Niche <center> <img src="data:image/png;base64,#images/growthCaterpillar.png" alt="drawing" width="900"/> </center> --- # Reconstruction <center> <img src="data:image/png;base64,#images/fitHudson1.png" alt="drawing" width="750"/> </center> # Reconstruction <center> <img src="data:image/png;base64,#images/fitHudson2.png" alt="drawing" width="750"/> </center> --- # Why is the temperature reconstruction poor? <center> <img src="data:image/png;base64,#images/temperatureParameters.png" alt="drawing" width="1000"/> </center> --- # Pollen Data <img src="data:image/png;base64,#./images/pollen.png" width="95%" /> --- # Pollen Data <img src="data:image/png;base64,#./images/pollen-locations.png" width="95%" /> --- # Fossil Pollen Data <img src="data:image/png;base64,#./images/fossil-pollen-locations.png" width="85%" /> --- # Data Model \begin{align*} [\mathbf{z}, \boldsymbol{\theta}_D, \boldsymbol{\theta}_P | \mathbf{y}] & \propto \color{red}{[\mathbf{y} | \boldsymbol{\theta}_D, \mathbf{z}]} [\mathbf{z} | \boldsymbol{\theta}_P] [\boldsymbol{\theta}_D] [\boldsymbol{\theta}_P] \end{align*} --- # Data Model \begin{align*} \mathbf{y}\left( \mathbf{s}_i, t \right) & \sim \color{red}{\operatorname{Dirichlet-Multinomial} \left( N, \exp \left(\mathbf{z}\left( \mathbf{s}_i, t \right) \boldsymbol{\beta} \right)\right)} \end{align*} <hr style="height:20px; visibility:hidden;" /> * Researchers take 1cm$^3$ cubes sediment samples along the length of a sediment core from a lake. <hr style="height:20px; visibility:hidden;" /> * Raw data are counts of each species `\(y(\mathbf{s}_i, t)\)` at site `\(\mathbf{s}_i\)` for time `\(t\)`. <hr style="height:20px; visibility:hidden;" /> * In each cube, researcher counts the first `\(N\)` pollen grains and identifies to species. <hr style="height:20px; visibility:hidden;" /> * Climate variable `\(\mathbf{z}(\mathbf{s}_i, t)\)` at site `\(\mathbf{s}_i\)` for time `\(t\)`. * **Only known for time `\(t=1\)`**. --- # Non-linear Data Model * Vegetation response to climate is non-linear. * Pollen are "aggregated" into groups across space and taxonomy. * "Niche" responses in the groups can produce multi-modal responses. <hr style="height:8px; visibility:hidden;" /> \begin{align*} \mathbf{y}\left( \mathbf{s}_i, t \right) & \sim \color{red}{\operatorname{Dirichlet-Multinomial} \left( N, \exp\left( \mathbf{B} \left( \mathbf{z}\left( \mathbf{s}_i, t \right) \right) \boldsymbol{\beta} \right) \right)} \end{align*} <hr style="height:8px; visibility:hidden;" /> * `\(\mathbf{B} \left( \mathbf{z}\left( \mathbf{s}_i, t \right) \right)\)` is a basis expansion of the covariates `\(\mathbf{z}\left( \mathbf{s}_i, t \right)\)`. * Use B-splines or Gaussian Processes as a basis. * `\(\mathbf{B} \left( \mathbf{z}\left( \mathbf{s}_i, t \right) \right)\)` is random. * Computationally challenging. <br /> * For `\(t \neq 1\)`, the `\(\mathbf{z} \left( \mathbf{s}_i, t \right)\)`s are unobserved. <br /> --- # Non-linear Calibration Model <img src="data:image/png;base64,#./images/alps-functional-fit-all-models-subset.png" width="95%" /> --- # Process Model \begin{align*} [\mathbf{z}, \boldsymbol{\theta}_D, \boldsymbol{\theta}_P | \mathbf{y}] & \propto [\mathbf{y} | \boldsymbol{\theta}_D, \mathbf{z}] \color{blue}{[\mathbf{z} | \boldsymbol{\theta}_P]}[\boldsymbol{\theta}_D] [\boldsymbol{\theta}_P] \end{align*} --- # Process Model | Dynamic Model * We are interested in estimating the latent process `\(\mathbf{z} \left( \mathbf{s}, t \right)\)`. \begin{align*} \color{blue}{\mathbf{z} \left(t \right) - \mathbf{X} \left( t \right) \boldsymbol{\gamma}} & \sim \color{blue}{\operatorname{N}\left(\mathbf{M}\left(t\right) \left( \mathbf{z} \left(t-1 \right) - \mathbf{X} \left( t \right) \boldsymbol{\gamma} \right), \boldsymbol{R}\left( \boldsymbol{\theta} \right) \right)} \end{align*} <hr style="height:20px; visibility:hidden;" /> * Assumes climate states nearby in time are more correlated than those further apart in time. <hr style="height:20px; visibility:hidden;" /> * `\(\mathbf{X} \left(t \right) \boldsymbol{\gamma}\)` are the fixed effects from covariates like latitude, elevation, etc. --- # Elevation covariates <img src="data:image/png;base64,#./images/elevation.png" width="95%" /> --- # Scaling the process for big data * Define a set of spatial knot locations `\(\mathbf{s}^{\star} = \left\{ \mathbf{s}_1^{\star}, \ldots, \mathbf{s}_m^{\star} \right\}\)`. <hr style="height:8px; visibility:hidden;" /> * `\(\boldsymbol{\eta}^{\star} \left( t \right) \sim \operatorname{N} \left( \mathbf{0}, \mathbf{R}^{\star}\left( \boldsymbol{\theta} \right) \right)\)`. <hr style="height:8px; visibility:hidden;" /> * `\(\mathbf{R}^{\star}\left( \boldsymbol{\theta} \right)\)` is the spatial covariance defined at the knot locations `\(\mathbf{s}^{\star}\)`. <hr style="height:8px; visibility:hidden;" /> * The linear interpolator from observed location `\(\mathbf{s}_i\)` to knot location `\(\mathbf{s}_j^{\star}\)` is `\(\mathbf{r} \left(\mathbf{s}_i, \mathbf{s}_j^{\star} \right) \mathbf{R}^{\star}\left( \boldsymbol{\theta} \right)^{-1}\)` where $\mathbf{r} \left(\mathbf{s}_i, \mathbf{s}_j^{\star} \right) = \operatorname{Cov} \left(\mathbf{s}_i, \mathbf{s}_j^{\star} \right)$ <hr style="height:8px; visibility:hidden;" /> --- # Predictive Process * `\(\boldsymbol{\eta} \left( t \right) \approx \mathbf{r} \left(\mathbf{s}, \mathbf{s}^{\star} \right) \mathbf{R}^{\star}\left( \boldsymbol{\theta} \right)^{-1} \tilde{\boldsymbol{\eta}} \left( t \right)\)`. * The predictive process can be shown to be the optimal predictor of the parent process `\(\boldsymbol{\eta} \left( t \right)\)` of dimension `\(m\)` * The dynamic climate process becomes \begin{align*} \color{blue}{\mathbf{z} \left(t \right) - \mathbf{X} \left( t \right) \boldsymbol{\gamma}} & \approx \color{blue}{\mathbf{M}\left(t\right) \left( \mathbf{z} \left(t-1 \right) - \mathbf{X} \left( t \right) \boldsymbol{\gamma} \right) + \mathbf{r} \left(\mathbf{s}, \mathbf{s}^{\star} \right) \mathbf{R}^{\star}\left( \boldsymbol{\theta} \right)^{-1} \boldsymbol{\tilde{\eta}} \left(t \right)} \end{align*} --- # Time Uncertainty * Each fossil pollen observation includes estimates of time uncertainty. * The time of the observation is uncertain. * Weight the likelihoods according to age-depth model. * Posterior distribution of ages. <br /> * For each observation fossil pollen observation an age-depth model gives a posterior distribution over dates. * Define `\(\omega \left(\mathbf{s}_i, t \right)\)` as P(age `\(\in (t-1, t)\)`). * `\([\mathbf{y} \left( \mathbf{s}_i, t \right) | \boldsymbol{\alpha} \left( \mathbf{s}_i, t \right) ] = \prod_{t=1}^T [\mathbf{y} \left( \mathbf{s}_i, t \right) | \boldsymbol{\alpha} \left( \mathbf{s}, t \right)]^{\omega_\left(\mathbf{s}_i, t \right)}\)`. --- # gitHub package * Non-spatial code available as a [gitHub package](https://github.com/jtipton25/BayesComposition). ```r devtools::install_github("jtipton25/BayesComposition") ``` <br /> * Includes options for multiple models including: * mixture models. * different likelihoods and link functions. * correlations in functional response. <br /> * Code in `C++` using `Rcpp` package for computation speed. --- # Reconstruction over time <img src="data:image/png;base64,#./images/alps-spatially-averaged-predictions.png" width="100%" /> --- # Reconstruction over time <img src="data:image/png;base64,#./images/alps-predictions-anomalies-mean-bspline.png" width="33%" /><img src="data:image/png;base64,#./images/alps-predictions-mean-bspline.png" width="33%" /><img src="data:image/png;base64,#./images/alps-predictions-sd-bspline.png" width="33%" /> --- # Reconstruction Inference * Current methods are site-level "transfer function" methods. * These methods ignore elevation, temporal autocorrelation, and spatial autocorrelation. * Sensitive to the data. * Poor quantification of uncertainty. * Unclear how to choose among models. * The spatial method is statistically principled. * Has higher power. * Smaller uncertainties that change with data (sample size, signal coherence, etc.). * Can use model selection methods (information criterion, etc). --- # Reconstruction Inference ```r # knitr::include_graphics(here::here("images", "Science.png")) knitr::include_graphics("./images/site-level-inference-reduced.png") ``` <img src="data:image/png;base64,#./images/site-level-inference-reduced.png" width="100%" /> --- # Conclusion **It is possible to put the science in your statistics.** * Takes some careful thinking and learning. * Opens the door to more powerful analyses. * More flexibility in the questions that can be answered.